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Abstract 



Fitting probabilistic models to data is often difficult, due to the general intractabil- 
ity of the partition function and its derivatives. Here we propose a new parameter 
estimation technique that does not require computing an intractable normalization 
factor or sampling from the equilibrium distribution of the model. This is achieved 
by establishing dynamics that would transform the observed data distribution into 
the model distribution, and then setting as the objective the minimization of the 
KL divergence between the data distribution and the distribution produced by run- 
ning the dynamics for an infinitesimal time. Score matching, minimum velocity 
learning, and certain forms of contrastive divergence are shown to be special cases 
of this learning technique. We demonstrate parameter estimation in Ising models, 
deep belief networks and a product of Student-t test model of natural scenes. In 
the Ising model case, current state of the art techniques are outperformed by ap- 
proximately two orders of magnitude in learning time, with comparable error in 
recovered parameters. This technique promises to broaden the class of probabilis- 
tic models that are practical for use with large, complex data sets. 



1 Introduction 



Estimating parameters for probabilistic models is a fundamental problem in many scientific and 
engineering disciplines. Unfortunately, most probabilistic learning techniques require calculating 
the normalization factor, or partition function, of the probabilistic model in question, or at least 
calculating its gradient. For the overwhelming majority of models there are no known analytic 
solutions, confining us to the restrictive subset of probabilistic models that can be solved analytically, 
or those that can be made tractable using approximate learning techniques. Thus, development of 
powerful new techniques for parameter estimation promises to greatly expand the variety of models 
that can be fit to complex data sets. 

Many approaches exist for approximate learning, including mean field theory and its expansions, 
variational Bayes techniques and a plethora of sampling or numerical integration based methods 
||23l|T0l[9j|5l. Of particular interest are contrastive divergence (CD), developed by Hinton, Welling 
and Carreira-Perpinan (241 SI, Hyvarinen's score matching (SM) |7| and the minimum velocity 
learning framework proposed by Movellan (TS] [T4| [161 • 

Contrastive divergence (241 SI is a variation on steepest gradient descent of the maximum (log) 
likelihood (ML) objective function. Rather than integrating over the full model distribution, CD 
approximates the partition function term in the gradient by averaging over the distribution obtained 
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Figure 1 : Dynamics are established which transform any initial distribution into the model distri- 
bution p(^) (0). The dashed red line indicates the family of distributions parametrized by 0, and the 
three successive figures illustrate graphically the progression of learning. Under maximum likeli- 
hood learning, model parameters are chosen so as to minimize the Kullback-Leibler divergence 
between the data distribution p^^^ and the model distribution p(^) (6>). Under minimum probability 
flow learning, however, the KL divergence between p*^^^ and p*^^^ is minimized instead, where p*^^^ 
is the distribution obtained by evolving p*^^^ for infinitesimal time under the dynamics. Here we 
represent graphically how pulling p*^^^ as close as possible to p*^^^ tends to pull p^^\0) close to 
p^^^ as well. 



after taking a few Markov chain Monte Carlo (MCMC) steps away from the data distributior[^ 
Qualitatively, one can imagine that the data distribution is contrasted against a distribution which has 
evolved only a small distance towards the model distribution, whereas it would contrasted against 
the true model distribution in traditional MCMC approaches. Although CD is not guaranteed to 
converge to the right answer, or even to a fixed point, it has proven to be an effective and fast 
heuristic for parameter estimation lfT2ll26ll . 

Score matching, developed by Aapo Hyvarinen Q, is a method that learns parameters in a proba- 
bilistic model using only derivatives of the energy function evaluated over the data distribution (see 
Equation (Vl\). This sidesteps the need to explicitly sample or integrate over the model distribution. 
In score matching one minimizes the expected square distance of the score function with respect to 
spatial coordinates given by the data distribution from the similar score function given by the model 
distribution. A number of connections have been made between score matching and other learning 
techniques (H EH [El El. 

Minimum velocity learning is an approach recently proposed by Movellan fT5\ that recasts a num- 
ber of the ideas behind CD, treating the minimization of the initial dynamics away from the data 
distribution as the goal itself rather than a surrogate for it. Movellan 's proposal is that rather than 
directly minimize the difference between the data and the model, one introduces system dynamics 
that have the model as their equilibrium distribution, and minimizes the initial flow of probability 
away from the data under those dynamics. If the model looks exactly like the data there will be no 
flow of probability, and if model and data are similar the flow of probability will tend to be mini- 
mal. Movellan applies this intuition to the specific case of distributions over continuous state spaces 
evolving via diffusion dynamics, and recovers the score matching objective function. The velocity 



^ The update rule for gradient descent of the negative log likelihood, or maximum likelihood objective 
function, is 

i i 

where p*^°^ and p*^°°^ (0) represent the data distribution and model distribution, respectively, E (0) is th e en - 
ergy function associated with the model distribution and i indexes the states of the system (see Section [TTJ. 
The second term in this gradient can be extremely difficult to compute (costing in general an amount of time 
exponential in the dimensionality of the system). Under contrastive divergence p -"^^ (0) is replaced by samples 
only a few Monte Carlo steps away from the data. 
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in minimum velocity learning is the difference in average drift velocities between particles diffusing 
under the model distribution and particles diffusing under the data distribution. 

Here we propose a framework called minimum probability flow learning (MPF), applicable to any 
parametric model, of which minimum velocity, SM and certain forms of CD are all special cases, 
and which is in many situations more powerful than any of these algorithms. We demonstrate that 
learning under this framework is effective and fast in a number of cases: Ising models 111 El, deep 
belief networks | 6 |, and the product of Student-t tests model for natural scenes 1251 . 



2 Minimum probability flow 



Our goal is to find the parameters that cause a probabilistic model to best agree with a set V of 
(assumed iid) observations of the state of a system. We will do this by proposing dynamics that 
guarantee the transformation of the data distribution into the model distribution, and then minimizing 
the KL divergence which results from running those dynamics for a short time e (see Figure [T]). 



2.1 Distributions 



The data distribution is represented by a vector p'^^^ with pf^ the fraction of the observations V 
in state i. The superscript (0) represents time t = under the system dynamics (which will be 



described in more detail in Section 2.2). For example, in a two variable binary system, p*^^^ would 
have four entries representing the fraction of the data in states 00, 01, 10 and 11 (Figure [2]). 

Our goal is to find the parameters 6 that cause a model distribution p*^^^ (6>) to best match the data 

distribution p*^^^ . The superscript (oo) on the model distribution indicates that this is the equilibrium 

distribution reached after running the dynamics for infinite time. Without loss of generality, we 

assume the model distribution is of the form 

(oo) _ exp(-^, {6)) 
V. (0)- , (1) 

where E (^) is referred to as the energy function, and the normalizing factor Z {0) is the partition 
function, 

Z{e) = Y,exp{-E,{ej) (2) 

i 

(here we have set the "temperature" of the system to 1). 



2.2 Dynamics 

Most Monte-Carlo algorithms rely on two core concepts from statistical physics, the first being 
conservation of probability as enforced by the master equation for the evolution of a distribution 
p(^) with time (HI: 

where pf"^ is the time derivative of pf"^ . Transition rates Tij (0), where i ^ j, give the rate at which 
probability flows from a state j into a state i. The first term of Equation ^ captures the flow of 
probability out of other states j into the state i, and the second captures flow out of i into other states 
j. The dependence on results from the requirement that the chosen dynamics cause p^^^ to flow 
to the equilibrium distribution p'^^^(6>). For readability, explicit dependence on will be dropped 
except where necessary. If we choose to set the diagonal elements of F to obey Ta = — Xlj/i ^ji^ 
then we can write the dynamics as 

p(^) = rp(^) (4) 

(see Figure [2]). The unique solution for p*^^^ i^ 

p(^) =exp(rt)p(°^ (5) 



The form chose n for T in Equation (|4|, coupled with the satisfaction of detailed balance and ergodicity 
guarantees that there is a unique eigenvector p*^"^-* of F with eigenvalue zero, and 



2.3 



introduced in section 
that all other eigenvalues of F have negative real parts. 
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Figure 2: Dynamics of minimum probability flow learning. Model dynamics represented by the 
probability flow matrix T {middle) determine how probability flows from the empirical histogram of 
the sample data points (left) to the equilibrium distribution of the model (right) after a sufficiently 
long time. In this example there are only four possible states for the system, which consists of a 
pair of binary variables, and the particular model parameters favor state 10 whereas the data falls on 
other states. 



2.3 Detailed Balance 



The second core concept is detailed balance. 



(6) 



which states that at equilibrium the probability flow from state i into state j equals the probability 
flow from j into i. When satisfied, detailed balance guarantees that the distribution p(^) (0) is a 
fixed point of the dynamics. Sampling in most Monte Carlo methods is performed by choosing T 
consistent with Equation f6[ (and the added requirement of ergodicity), then stochastically running 
the dynamics of Equationjs] Note that there is no need to restrict the dynamics defined by T to those 
of any real physical process, such as diffusion. 

Equation|6]can be written in terms of the model's energy function E {0) by substituting in Equation[T] 

forp(^) (i9): 



r,, exp {-E, {0)) = exp {-Ej {0)) . 



(7) 



r is underconstrained by the above equation. Motivated by symmetry and aesthetics, we choose as 
the form for the non-diagonal entries in F 



Qij exp 



1 



[E, {0) - E, {6)) 



where the connectivity function 



= 9ji= I I 



unconnected states 
connected states 



(8) 



(9) 



determines which states are allowed to directly exchange probability with each otheij^ gij can be 
set such that F is extremely sparse (see Section [23] ). Theoretically, to guarantee convergence to 
the model distribution, the non-zero elements of F must be chosen such that, given sufficient time, 
probability can flow between any pair of states. 



2.4 Objective Function 

Maximum likelihood parameter estimation involves maximizing the likelihood of some observations 
V under a model, or equivalently minimizing the KL divergence between the data distribution p*^^^ 

^The non-zero F may also be sampled from a proposal distribution rather than set via a deterministic 
scheme, in which case gij takes on the role of proposal distribution, and a factor of ( ^ j ^ enters into Vij . 
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and model distribution p'^^^ 



^ML = argmmI)K^(p(«)||p(-)(^) 



(10) 



Rather than running the dynamics for infinite time, we propose to minimize the KL divergence after 
running the dynamics for an infinitesimal time e, 



^MPF = arg min K (0) 

6 

K{e) = Dkl{p^''^\\p^'HO)). 
For small e, Dkl (p^°^ | |p^'^ (0)) can be approximated by a first order Taylor expansion, 



ODkl (p(°)||p(*) (0)) 



t=o 



dt 



(11) 
(12) 

(13) 



Further algebra (see Appendix |a{i reduces K {9) to a measure of the flow of probability, at time 
t = under the dynamics, out of data states X> into non-data states. 



with gradienQ 

dK (6) _ 

de ~ 



e 



dEj {0) dEi {0) 



dO 



dO 



,exp -{Ej{0)-E,{0)) 



g,,exp -{E,{0)-E,{0)) 



(14) 



(16) 



where is the number of observed data points. Note that Equations ([14]) and ([16]) do not depend 
on the partition function Z (6>) or its derivatives. 

K {0) is uniquely zero when p*^^^ and p(^) [0) are exactly equal (although in general K {0) provides 
a lower bound rather than an upper bound on Dkl (p^°^ | |p^^^ {0)) ). In addition, K {0) IS convex 
for all models p*^"^^ (6>) in the exponential family - that is, models whose energy functions E (6>) are 
linear in their parameters 6 1 13] (see Appendix [B]). 



2.5 Tractability 

The vector p*^^^ is typically huge, as is F (e.g., 2^ and 2^ x 2^, respectively, for an A^-bit binary 
system). Naively, this would seem to prohibit evaluation and minimization of the objective function. 
Fortunately, all the elements in p*^^^ not corresponding to observations are zero. This allows us to 
ignore all those Tij for which no data point exists at state j. Additionally, there is a great deal of 
flexibility as far as which elements of g, and thus F, can be set to zero. By populating F so as to 
connect each state to a small fixed number of additional states, the cost of the algorithm in both 
memory and time is 0{\V\), which does not depend on the number of system states, only on the 
number of observed data points. 



2.6 Continuous Systems 

Although we have motivated this technique using systems with a large, but finite, number of states, 
it generalizes in a straightforward manner to continuous systems. The flow matrix F and distribution 



"^The contrastive divergence update rule can be written in the form 

'dEj{0) dE^{0) 



dO 06 



[probability of MCMC step from j i] (15) 



with obvious similarities to the MPF learning gradient. Thus steepest gradient descent under MPF resembles 
CD updates, but with the sampling/rejection step replaced by a weighting factor. Note that this difference in 
form provides MPF with an objective function, and guarantees a unique global minimum when model and data 
distributions agree. 
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Figure 3: A demonstration of rapid fitting of the Ising model by minimum probabiHty flow learning. 
The mean absolute error in the learned model's correlation matrix is shown as a functions of learning 
time for 40 and 100 unit fully connected Ising models. Convergence is reached in about 15 seconds 
for 20, 000 samples from the 40 unit model (left) and in about 1 minute for 100, 000 samples from 
the 100 unit model (right). Details of the 100 unit model can be seen in Figure]?] 



vectors p^^^ transition from being very large to being infinite in size. T can still be chosen to connect 
each state to a small, finite, number of additional states however, and only outgoing probability flow 
from states with data contributes to the objective function, so the cost of learning remains largely 
unchanged. 

In addition, for a particular pattern of connectivity in F this objective function, like Movellan's ifTSi , 
reduces to score matching |7J. Taking the limit of connections between all states within a small 
distance Cx of each other, and then Taylor expanding in Cx, one can show that, up to an overall 
constant and scaling factor 

l\/E{xi) -VEixi) -\/^E{xi) . 



SM 



(17) 



This reproduces the link discovered by Movellan ifTSl between diffusion dynamics over continuous 
spaces and score matching. 



3 Experimental Results 

Matlab code implementing minimum probability flow learning for each of the following cases is 
available upon request. A public toolkit is under construction. 

All minimization was performed using Mark Schmidt's remarkably effective minFunc ifTSl . 



3.1 Ising model 

The Ising model has a long and storied history in physics |3 1 and machine learning |T| and it has 
recently been found to be a surprisingly useful model for networks of neurons in the retina flUEIl- 
The ability to fit Ising models to the activity of large groups of simultaneously recorded neurons is 
of current interest given the increasing number of these types of data sets from the retina, cortex and 
other brain structures. 

We fit an Ising model (fully visible Boltzmann machine) of the form 



p(-)(x;J) 



1 



Z{3) 



exp 



(18) 



to a set of N d-element iid data samples {^x^'^^\i = l...A^} generated via Gibbs sampling from an 
Ising model as described below, where each of the d elements of x is either or 1. Because each 
Xi G {0, 1}, = x^, we can write the energy function as 



£/(x5 J) — ^ ^ JijXiXj -\- ^ ^ Jii 



(19) 
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Figure 4: An example 100 unit Ising model fit using minimum probability flow learning, (left) 
Randomly chosen Gaussian coupling matrix J (top) with variance 0.04 and associated correlation 
matrix C (bottom) for a 100 unit, fully-connected Ising model. The diagonal has been removed from 
the correlation matrix C for increased visibility, (center) The recovered coupling and correlation 
matrices after minimum probability flow learning on 100, 000 samples from the model in the left 
panels, (right) The error in recovery of the coupling and correlation matrices. 



The probability flow matrix T has 2^ x 2^ elements, but for learning we populate it extremely 
sparsely, setting 



9ij — 9ji 



states i and j differ by single bit flip 
otherwise 



(20) 



Figure [3] shows the average error in predicted correlations as a function of learning time for 20, 000 
samples from a 40 unit, fully connected Ising model. The Jij used were graciously provided by 
Broderick and coauthors, and were identical to those used for synthetic data generation in the 2008 
paper "Faster solutions of the inverse pairwise Ising problem" 0. Training was performed on 
20, 000 samples so as to match the number of samples used in section III. A. of Broderick et al. 
Note that given sufficient samples, the minimum probability flow algorithm would converge exactly 
to the right answer, as learning in the Ising model is convex (see Appendix [B]), and has its global 
minimum at the true solution. On an 8 core 2.33 GHz Intel Xeon, the learning converges in about 
15 seconds. Broderick et al. perform a similar learning task on a 100-CPU grid computing cluster, 
with a convergence time of approximately 200 seconds. 

Similar learning was performed for 100, 000 samples from a 100 unit, fully connected Ising model. 
A coupling matrix was chosen with elements randomly drawn from a Gaussian with mean and 
variance 0.04. Using the minimum probability flow learning technique, learning took approximately 
1 minute, compared to roughly 12 hours for a 100 unit (nearest neighbor coupling only) model of 
retinal data (201 (personal communication, J. Shlens). Figure |4] demonstrates the recovery of the 
coupling and correlation matrices for our fully connected Ising model, while Figure [3] shows the 
time course for learning. 



3.2 Deep Belief Network 



As a demonstration of learning on a more complex discrete valued model, we trained a 4 layer 
deep belief network (DBN) | 6 1 on MNIST handwritten digits. A DBN consists of stacked restricted 
Boltzmann machines (RBMs), such that the hidden layer of one RBM forms the visible layer of the 
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Figure 5: A deep belief network trained using minimum probability flow learning and contrastive 
divergence, (left) A four layer deep belief network was trained on the MNIST postal hand written 
digits dataset. (center) Confabulations after training via minimum probability flow learning. A 
reasonable probabilistic model for handwritten digits has been learned, (right) Confabulations after 
training via single step CD. Note the uneven distribution of digit occurrences. 




Figure 6: A continuous state space model fit using minimum probability flow learning. The sparsity 
parameter a (left) and receptive fields J (right) are shown for a product of Student-t model trained 
on natural images. Both plots are ordered by sparsity. 



next. Each RBM has the form: 



p(^)(xvis,Xhid;W) = 



p(-)(xvis;W) = 



Z(W) 



Z(W) 



exp 



exp 



5: log 



^vis,i'^hid, 



exp 



(21) 



(22) 



Note that sampling-free application of the minimum probability flow algorithm requires analytically 
marginalizing over the hidden units. RBMs were trained in sequence, starting at the bottom layer, 
on 10,000 samples from the MNIST postal hand written digits data set. As in the Ising case, the 
probability flow matrix F was populated so as to connect every state to all states which differed by 
only a single bit flip (see Equation [20]). Training was performed by both minimum probability flow 
and single step CD to allow a simple comparison of the two techniques (note that CD turns into full 
ML learning as the number of steps is increased, and that the quality of the CD answer can thus be 
improved at the cost of computational time by using many- step CD). 

Confabulations were performed by Gibbs sampling from the top layer RBM, then propagating each 
sample back down to the pixel layer by way of the conditional distribution p*^^^(xvis|xhid; W^) 
for each of the intermediary RBMs, where k indexes the layer in the stack. As shown in Figure |5| 
minimum probability flow learned a good model of handwritten digits. 

3.3 Product of Student-t Model 



As a demonstration of parameter estimation in continuous state space, non-analytically normaliz- 
able, probabilistic models, we trained a product of Student-t model 1251 with 100 receptive fields J 



8 



on 100, 000 10 X 10 natural image patches. The product of Student-t model has the form 

p(^) (x; J, a) oc e- iog[i+(j.x)2] ^ ^^3) 

where the a parameter controls the sparsity of the distribution. Qij was chosen such that each state 
X was connected to 2 states x chosen by adding Gaussian noise to x and rescaling to maintain patch 
variance — that is x = (x + n) for Gaussian noise vector n ~ (0, 0.1). Figure j6j shows 

that this model learns oriented edges, as expected. 

4 Summary 

We have presented a novel framework for efficient learning in the context of any parametric model. 
This method was inspired by the minimum velocity approach developed by Movellan, and it reduces 
to that technique as well as to score matching and some forms of contrastive divergence under 
suitable choices for the dynamics and state space. By decoupling the dynamics from any specific 
physical process, such as diffusion, and focusing on the initial flow of probability from the data to a 
subset of other states chosen in part for their utility and convenience, we have arrived at a framework 
that is not only more general than previous approaches, but also potentially much more powerful. 
We expect that this framework will render some previously intractable models more amenable to 
analysis. 
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APPENDICES 



A Taylor Expansion of KL Divergence 



ODkl (p(°)||p(*) (0)) 



t=0 



dt 



t=0 



= e 



+ e 

-I 

dt ' 



ODkl (p^o^IIp^*) (0)) 



dt 



t=o 



^pT' log 



(0)' 



iev 



it) 



p\ dp. 



it) 



iev Pi 



(0) dt 



E 



dpi 



(t) 



dt 



d_ 
di 



= ^E 



i-E^'^ 



it) 



dt 







(0) 

pT E E^^^' 



where we used the fact that ^i^x,pf^ + ^i^vPf^ ~ ^- "^^^^ implies that the rate of growth of 
the KL divergence at time t = equals the total initial flow of probability from states with data into 
states without. 



(A-1) 
(A-2) 
(A-3) 

(A-4) 

(A-5) 

(A-6) 

(A-7) 

(A-8) 
(A-9) 
(A-10) 



B Convexity 

As observed by Macke and Gerwinn LI 3 J . the MPF objective function is convex for models in the 
exponential family. 



We wish to minimize 



K has derivative 



^ = E E (B-i) 
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and Hessian 




(B-4) 



The first term is a weighted sum of outer products, with non-negative weights \rijpi^\ and is thus 
positive semidefinite. The second term is for models in the exponential family (those with energy 
functions Hnear in their parameters). 

Parameter estimation for models in the exponential family is therefore convex using minimum prob- 
ability flow learning. 
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